BMC Bioinformatics — Latest Matching Preprints

1

MOPower: an R-shiny application for the simulation and power calculation of multi-omics studies.

Syed, H.; Otto, G. W.; Kelberman, D.; Bacchelli, C.; Beales, P. L.

2021-12-21 bioinformatics 10.1101/2021.12.19.473339 medRxiv

Top 0.1%

53.7%

Show abstract

BackgroundMulti-omics studies are increasingly used to help understand the underlying mechanisms of clinical phenotypes, integrating information from the genome, transcriptome, epigenome, metabolome, proteome and microbiome. This integration of data is of particular use in rare disease studies where the sample sizes are often relatively small. Methods development for multi-omics studies is in its early stages due to the complexity of the different individual data types. There is a need for software to perform data simulation and power calculation for multi-omics studies to test these different methodologies and help calculate sample size before the initiation of a study. This software, in turn, will optimise the success of a study. ResultsThe interactive R shiny application MOPower described below simulates data based on three different omics using statistical distributions. It calculates the power to detect an association with the phenotype through analysis of n number of replicates using a variety of the latest multi-omics analysis models and packages. The simulation study confirms the efficiency of the software when handling thousands of simulations over ten different sample sizes. The average time elapsed for a power calculation run between integration models was approximately 500 seconds. Additionally, for the given study design model, power varied with the increase in the number of features affecting each method differently. For example, using MOFA had an increase in power to detect an association when the study sample size equally matched the number of features. ConclusionsMOPower addresses the need for flexible and user-friendly software that undertakes power calculations for multi-omics studies. MOPower offers users a wide variety of integration methods to test and full customisation of omics features to cover a range of study designs.

2

CodAn: predictive models for the characterization of mRNA transcripts in Eukaryotes

Nachtigall, P. G.; Kashiwabara, A. Y.; Durham, A. M.

2019-10-07 bioinformatics 10.1101/794107 medRxiv

Top 0.1%

51.0%

Show abstract

Characterization of the coding sequences (CDSs) is an essential step on transcriptome annotation. Incorrect characterization of CDSs can lead to the prediction of non-existent proteins that can eventually compromise knowledge if databases are populated with similar incorrect predictions made in different genomes. Even though some recent methods have succeeded in correctly prediction of the stop codon position in strand-specific sequences, prediction of the complete CDS is still far from a gold standard. More importantly, prediction in strand-blind sequences and in partial sequences is deficient, presenting very low accuracy. Here, we present CodAn, a new computational approach to predict CDS and UTR, that significantly pushes the boundaries of CDS prediction in strand-blind and in partial sequences, increases strand-specific full-CDS predictions and matches or surpasses gold-standard results in strand-specific stop codon predictions. CodAn is freely available for download at https://github.com/pedronachtigall/CodAn.

3

Unleashing alternative polyadenylation analyses with REPAC

Imada, E. L.; Wilks, C.; Langmead, B.; Marchionni, L.

2022-05-03 bioinformatics 10.1101/2022.03.14.484280 medRxiv

Top 0.1%

45.9%

Show abstract

Alternative polyadenylation (APA) is an important post-transcriptional mechanism that has major implications in biological processes and diseases. Although specialized sequencing methods for polyadenylation exist, their presence in public repositories is extremely limited when compared to traditional RNA-sequencing. To overcome this, we developed REPAC, a framework for the analysis of APA from RNA-sequencing data. REPAC implements a new method for detection of APA and is designed to take advantage of recount3 which enables a streamlined way to analyze over 750,000 publicly available samples. Using REPAC, we investigated the landscape of APA caused by activation of B cells. Our analysis revealed that during this process, hundreds of genes are regulated by APA, most notably genes involved in the secretion pathway which is central for the transition to antibody-secreting B-cells. Moreover, we also showed that many genes associated with interferon response are also shortened, suggesting that APA might also play a significant role in the immune response. We also show that REPAC is faster than alternative methods by at least 7-fold and that it scales well to analysis involving hundreds of samples. Overall, the REPAC method offers an accurate, easy, and convenient solution for the exploration of APA across many phenotypes.

4

SYNY: a pipeline to investigate and visualize collinearity between genomes

Julian, A. T.; Pombert, J.-F.

2024-05-13 bioinformatics 10.1101/2024.05.09.593317 medRxiv

Top 0.1%

42.0%

Show abstract

Investigating collinearity between chromosomes is often used in comparative genomics to help identify gene orthologs, pinpoint genes that might have been overlooked as part of annotation processes and/or perform various evolutionary inferences. Collinear segments, also known as syntenic blocks, can be inferred from sequence alignments and/or from the identification of genes arrayed in the same order and relative orientations between investigated genomes. To help perform these analyses and assess their outcomes, we built a simple pipeline called SYNY (for synteny) that implements the two distinct approaches and produces different visualizations. The SYNY pipeline was built with ease of use in mind and runs on modest hardware. The pipeline is written in Perl and Python and is available on GitHub (https://github.com/PombertLab/SYNY) under the permissive MIT license.

5

PISCES: a package for rapid quantitation and quality control of large scale mRNA-seq datasets

Shirley, M. D.; Radhakrishna, V. K.; Golji, J.; Korn, J. M.

2020-12-02 bioinformatics 10.1101/2020.12.01.390575 medRxiv

Top 0.1%

42.0%

Show abstract

PISCES eases processing of large mRNA-seq experiments by encouraging capture of metadata using simple textual file formats, processing samples on either a single machine or in parallel on a high performance computing cluster (HPC), validating sample identity using genetic fingerprinting, and summarizing all outputs in analysis-ready data matrices. PISCES consists of two modules: 1) compute cluster-aware analysis of individual mRNA-seq libraries including species detection, SNP genotyping, library geometry detection, and quantitation using salmon, and 2) gene-level transcript aggregation, transcriptional and read-based QC, TMM normalization and differential expression analysis of multiple libraries to produce data ready for visualization and further analysis. PISCES is implemented as a python3 package and is bundled with all necessary dependencies to enable reproducible analysis and easy deployment. JSON configuration files are used to build and identify transcriptome indices, and CSV files are used to supply sample metadata and to define comparison groups for differential expression analysis using DEseq2. PISCES builds on many existing open-source tools, and releases of PISCES are available on GitHub or the python package index (PyPI).

6

MenDEL: PCR Primer Design as Constrained Optimization Process

German, S.; Mitchell, L.; Vela Gartner, A.; Fenyo, D.; Boeke, J. D.

2022-06-29 bioinformatics 10.1101/2022.06.26.496474 medRxiv

Top 0.1%

41.2%

Show abstract

MotivationThe synthesis of large DNA assemblies has applications in biotechnology, and can help us better understand genome biology. These large DNA assemblies are often constructed from many smaller DNA segments, and it is critical to assess that they are correctly assembled. One low cost and rapid method to ensure that the connection between each segment is correct is to use PCR with primer pairs that span assembly junctions. However, the design of PCR primers for large assemblies consisting of multiple segments, and therefore containing multiple assembly junctions, is a challenging process. Rule-based automation of the process often results in finding primers that satisfy general criteria, but are not necessarily the best fit for every particular junction. ResultsWe have developed MenDEL - a web-based DNA design application, that provides a primer pair computation tool for multiple assembly junctions in such a way that for each junction we automatically pick the optimal pair of primers based on user specified constraints. Availability and ImplementationThe MenDEL application is available at https://mendel-isg.nyumc.org to registered users, and the code base for computing junction primers is available at https://github.com/MendelProject/PrimerOptimization.

7

ePat: extended PROVEAN annotation tool

Ito, T.; Yoshitake, K.; Iwata, T.

2021-12-23 bioinformatics 10.1101/2021.12.21.468911 medRxiv

Top 0.1%

39.7%

Show abstract

The ePat (extended PROVEAN annotation tool) is a software tool that extends the functionality of PROVEAN: a software tool for predicting whether amino acid substitutions and indels will affect the biological function of proteins. The ePat extends the conventional PROVEAN to enable the following two things, which the conventional PROVEAN could not calculate the pathogenicity of these variants. First is to calculate and score the pathogenicity of indel mutations with frameshift and variants near splice junctions. Second is to use batch processing to calculate the pathogenicity of multiple variants into a variants list (VCF file) in a single step. ePat can help extract variants that affect biological functions by utilizing not only point mutations, and indel mutations that does not cause frameshift, but also frameshift, stop gain, and splice variants. These extended features will increase detection rate and improve diagnostic of inherited diseases or associate specific variant to phenotype.

8

XVCF: Exquisite Visualization of VCF Data from Genomic Experiments

Almuneef, G.; Aljouie, A.; Bokhari, Y.; Almazroa, A.; Rashid, M.

2025-05-06 bioinformatics 10.1101/2025.04.30.651450 medRxiv

Top 0.1%

39.6%

Show abstract

BackgroundHigh-throughput genomic analyses of germline and cancer genomes facilitate the identification of causal and actionable genetic variants. The recent advances in next-generation sequencing technology generated large-scale genomic and/or multi-omics datasets from different disease types or models. Owing to the huge volume of data coming out of genomic experiments, scientists are facing challenges in handling, manipulating, visualizing, and interpreting the data. Currently, available tools to visualize genetic variants from VCF files are not very user-friendly as most of them require knowledge of command line tools or scripts to install and run those software. Moreover, biologists or clinicians lack this knowledge of computer programming. Therefore, graphical user interface (GUI) based tools or software are needed to effectively summarize and visualize the huge volume of genomic data such as VCF data. MethodsWe have developed a Shiny App, interactive tool using the R programming language that utilizes other R packages like "vcfR" and "maftools" to visualize and generate quality control metrics for genetic data effectively and exquisitely. A key improvement is the addition of a user-friendly interface, providing researchers with an interactive way to explore VCF or Cancer genomics data. Our tool is powered by Shiny, making it even easier for researchers to analyze and visualize genomic variation data using a GUI. Researchers can upload their datasets and customize the analysis parameters to suit their specific research needs. ResultsA user-friendly interactive tool has been developed for the summarization and visualization of data related to genomic variation research. The application features an easy and friendly interface, allowing users to perform various functions such as data loading, summarization, and visualization interactively. XVCF offers an easy-to-use GUI platform to read genetic variant data (annotated or unannotated) and extract useful information such as read depth, mapping quality, genotype, quality control summary, and allele frequency from unannotated data. In the second module of XVCF, the cancer genomic data (annotated, so far supported by ANNOVAR) is analyzed using "maftools" to produce oncoplot, comparison of mutational load across different TCGA datasets, gene summary, etc. XVCF is available for free download from https://github.com/rashidma/XVCF. ConclusionXVCF Shiny web application can serve as a robust visualization and quality control GUI platform for the germline and cancer genomics dataset. We expect this tool will be immensely useful for researchers with less computational or technical knowledge. Being a shiny R package, XVCF can be installed across different operating systems and utilize different computer hardware configurations.

9

Bayesian Analysis for 3D combinatorial CRISPR screens

Madenach, L.; Lohoff, C.

2022-06-02 bioinformatics 10.1101/2022.06.02.494493 medRxiv

Top 0.1%

39.6%

Show abstract

Combinatorial CRISPR screens are a well-established tool for the investigation of genetic interactions in a high-throughput fashion. Currently, advancements from 2D combinatorial CRISPR screens towards 3D combinatorial screens are made, but at the same time an easy-to-use computational method for the analysis of 3D combinatorial screens is missing. Here we propose a Bayesian analysis method for 3D CRISPR screens based on a well-established 2D CRISPR screen analysis protocol. With our tool we hope to provide researchers with an out-of-the-box analysis solution, avoiding the need for time-consuming and resource-intensive development of custom analysis protocols.

10

Varia: Prediction, analysis and visualisation of variable genes

Mackenzie, G.; Jensen, R.; Lavstsen, T.; Otto, T.

2020-12-16 bioinformatics 10.1101/2020.12.15.422815 medRxiv

Top 0.1%

39.6%

Show abstract

Assessing the diversity or expression of variable gene families in pathogens can inform about immune escape mechanisms or host interaction phenotypes of clinical relevance. However, obtaining the sequences and quantifying their expression is a challenge. Here, we present a tool, which based on unique sequence tag similarity between members of a gene family, predicts the domains encoded by the queried gene. As an example, we are using the var gene family, encoding the major virulence proteins (PfEMP1) of the human malaria parasite, Plasmodium falciparum. We developed Varia, which predicts the likely var gene sequence and encoded protein domain composition of a gene from short sequence tags. We provide a new extended annotated var genome database, in which Varia identifies genes with identical tag sequences and compares these to return the most probable domain composition of the query gene. Varias ability to predict correct PfEMP1 domain compositions from short var sequence tags was tested in two complementary pipelines to (a) return the putative gene sequences and domain compositions of the query gene from any partial sequence provided, thereby enabling detailed assessment of specific genes putative function and experimental validation of these (b) to accommodate rapid profiling of var gene expression in complex patient samples, by compiling the overall domain prevalence among var transcripts predicted identified and quantified by next generation sequencing of so-called var DBL-sequence tags. Availability and implementationVaria is available on GitHub (https://github.com/GCJMacken-zie/Varia) under the MIT license. Contactthomasl@sund.ku.dk, thomasdan.otto@glasgow.ac.uk

11

Bootstrap Evaluation of Association Matrices (BEAM) for Integrating Multiple Omics Profiles with Multiple Outcomes

Seffernick, A. E.; Cao, X.; Cheng, C.; Yang, W.; Autry, R. J.; Yang, J. J.; Pui, C.-H.; Teachey, D. T.; Lamba, J. K.; Mullighan, C. G.; Pounds, S. B.

2024-08-03 bioinformatics 10.1101/2024.07.31.605805 medRxiv

Top 0.1%

39.4%

Show abstract

MotivationLarge datasets containing multiple clinical and omics measurements for each subject motivate the development of new statistical methods to integrate these data to advance scientific discovery. ModelWe propose bootstrap evaluation of association matrices (BEAM), which integrates multiple omics profiles with multiple clinical endpoints. BEAM associates a set omic features with clinical endpoints via regression models and then uses bootstrap resampling to determine statistical significance of the set. Unlike existing methods, BEAM uniquely accommodates an arbitrary number of omic profiles and endpoints. ResultsIn simulations, BEAM performed similarly to the theoretically best simple test and outperformed other integrated analysis methods. In an example pediatric leukemia application, BEAM identified several genes with biological relevance established by a CRISPR assay that had been missed by univariate screens and other integrated analysis methods. Thus, BEAM is a powerful, flexible, and robust tool to identify genes for further laboratory and/or clinical research evaluation. AvailabilitySource code, documentation, and a vignette for BEAM are available on GitHub at: https://github.com/annaSeffernick/BEAMR. The R package is available from CRAN at: https://cran.r-project.org/package=BEAMR. ContactStanley.Pounds@stjude.org Supplementary InformationSupplementary data are available at the journals website.

12

Inferring copy number variation from gene expression data: methods, comparisons, and applications to oncology

Boen, J.; Wagner, J. P.; Di Nanni, N.

2021-10-19 bioinformatics 10.1101/2021.10.18.463991 medRxiv

Top 0.1%

38.9%

Show abstract

Copy number variations (CNVs) are genomic events where the number of copies of a particular gene varies from cell to cell. Cancer cells are associated with somatic CNV changes resulting in gene amplifications and gene deletions. However, short of single-cell whole-genome sequencing, it is difficult to detect and quantify CNV events in single cells. In contrast, the rapid development of single-cell RNA sequencing (scRNA-seq) technologies has enabled easy acquisition of single-cell gene expression data. In this work, we employ three methods to infer CNV events from scRNA-seq data and provide a statistical comparison of the methods results. In addition, we combine the analysis of scRNA-seq and inferred CNV data to visualize and determine subpopulations and heterogeneity in tumor cell populations.

13

ProteoSync, a program for ortholog selection, automated sequence alignment and conservation projection onto protein atomic coordinates

Sicheri, E.; Mao, D.; Sicheri, F.

2025-03-14 bioinformatics 10.1101/2025.03.09.642228 medRxiv

Top 0.1%

38.8%

Show abstract

The projection of conservation onto the surface of a proteins 3D structure is a powerful way of inferring functionally important regions. At present, the workflow for doing so can be involved and tedious. For this reason, we created ProteoSync, a Python program that semi-automates the process. The program creates an annotated sequence alignment of orthologs from a diverse set of selectable species, and enables the fast projection of amino acid conservation onto a predicted or known 3D model in PyMOL[1].

14

ipADMIXTURE: R package for inferring sub-population clusters based on genetic admixture

Amornbunchornvej, C.; Wangkumhang, P.; Tongsima, S.

2020-03-23 bioinformatics 10.1101/2020.03.21.001206 medRxiv

Top 0.1%

38.8%

Show abstract

ipADMIXTURE is an R package to infer clusters and their phylogeny based on Q matrices of genetic admixture analysis. It is the first software of its kind to infer not just only clusters, but also the hierarchy of sub-populations w.r.t. the minimum number of ancestors that split any pair of clusters apart. Since inputs of the package, Q matrices, can be obtained from well-known software (ADMIXTURE, STRUCTURE, etc.) and the Q matrices are mandatory information that are used in genetic population structure study, our package has a potential to help scientists and researchers to find deeper explanation of admixture analysis in their studies. Our package comes with a user-friendly interface to make the software accessible for everyone.

15

Geneplot: a coordinate conversion approach for graphical representation of protein domain data on the exon-intron structure of a gene.

Gonzalez-Ibeas, D.

2022-11-09 bioinformatics 10.1101/2022.11.08.513416 medRxiv

Top 0.1%

38.6%

Show abstract

Graphical representation of single gene data, including subgenic features, polymorphisms and protein domains, is part of the regular routine of genome analyses. In the case of protein-coding genes, integration of such information with the exon-intron structure has advantages since intron polymorphisms may also have a biological impact, and the extent to which exons and protein domains overlap is of interest to evolutionary research. This report introduces geneplot, an open-source Python library to generate this type of graphical output from standard file formats. The library applies a coordinate conversion approach in order to represent protein domain data on genomic areas.

16

art_modern: An Accelerated ART Simulator of Diverse Next-Generation Sequencing Reads

YU, Z.

2026-02-23 bioinformatics 10.64898/2026.02.20.707060 medRxiv

Top 0.1%

38.5%

Show abstract

SummaryFast simulation of next-generation sequencing (NGS) data is vital for software development and benchmarking. Here we describe art_modern, an accelerated ART simulator that can simulate various NGS data. We accelerated ART using updated sampling algorithms, single-instruction multiple-data (SIMD) instruction-set extensions (ISEs), thread- and node-level parallelism, and an asynchronous output writer, while enabling simulation of transcriptome profiling data by supporting contig-specific coverage with strand information. The new implementation was benchmarked against popular performance-oriented NGS simulators, revealing a 75-77% reduction in CPU time and a 15-24 times acceleration in wall-clock time on a multi-core machine compared to the original implementation. With this simulator, the process of developing and benchmarking NGS sequence analysis algorithms can be largely accelerated. Availability and ImplementationThe software is implemented in C++17 with CMake as the building system. It can be built and executed on a modern GNU/Linux operating system with Boost, Zlib, and a C++17 compiler, with further acceleration available using Intel OneAPI C++/DPC++ compilers and Intel oneAPI MKL random generators. The software is available at https://github.com/YU-Zhejian/art_modern under the GNU General Public License v3. ContactZhejian Yu (yuzj25@seas.upenn.edu)

17

ZARP: An automated workflow for processing of RNA-seq data

Katsantoni, M.; Gypas, F.; Herrmann, C. J.; Burri, D.; Bak, M.; Iborra, P.; Agarwal, K.; Ataman, M.; Boersch, A.; Zavolan, M.; Kanitz, A.

2021-11-19 bioinformatics 10.1101/2021.11.18.469017 medRxiv

Top 0.1%

38.0%

Show abstract

RNA sequencing (RNA-seq) is a crucial technique for many scientific studies and multiple models, and software packages have been developed for the processing and analysis of such data. Given the plethora of available tools, choosing the most appropriate ones is a time-consuming process that requires an in-depth understanding of the data, as well as of the principles and parameters of each tool. In addition, packages designed for individual tasks are developed in different programming languages and have dependencies of various degrees of complexity, which renders their installation and execution challenging for users with limited computational expertise. The use of workflow languages and execution engines with support for virtualization and encapsulation options such as containers and Conda environments facilitates these tasks considerably. Computational workflows defined in those languages can be reliably shared with the scientific community, enhancing reusability, while improving reproducibility of results by making individual analysis steps more transparent. Here we present ZARP, a general purpose RNA-seq analysis workflow which builds on state-of-the-art software in the field to facilitate the analysis of RNA-seq data sets. ZARP is developed in the Snakemake workflow language using best software development practices. It can run locally or in a cluster environment, generating extensive reports not only of the data but also of the options utilized. It is built using modern technologies with the ultimate goal to reduce the hands-on time for bioinformaticians and non-expert users. ZARP is available under a permissive Open Source license and open to contributions by the scientific community. Contactmihaela.zavolan@unibas.ch, alexander.kanitz@unibas.ch

18

Evidence-driven biases in alternative splicing inferred from NCBI Eukaryotic Genome Annotation Pipeline metadata

de la Fuente, R.; Diaz-Villanueva, W.; Arnau, V.; Moya, A.

2025-05-31 bioinformatics 10.1101/2025.05.27.656353 medRxiv

Top 0.1%

35.6%

Show abstract

The NCBI Eukaryotic Genome Annotation Pipeline (EGAP) predicts coding sequences by integrating transcriptomic and proteomic data with computational approaches, providing structural information that can be used to infer alternative transcript isoforms. However, accurate estimation of alternative splicing events depends on high-quality genome annotations, particularly in genome-wide analyses. The extent to which annotation pipelines influence these inferred splicing patterns has remained largely unexplored. In this study, we quantify potential biases associated with the EGAP annotation pipeline, and find that specific annotation features strongly influence these estimates, particularly the percentage of coding sequences that are supported by experimental evidence. Further, we implemented a polynomial regression model to normalize splicing levels, generating an adjusted metric that minimizes evidence-driven biases. This framework may serve as a basis for future investigations into splicing complexity and comparative genomics.

19

a2iHelper: a Python toolkit for a differential editing site analysis of RNA-Seq data

Ribas, G. T.; Guizelini, D.; Riella, L. V.; Riella, C. V.

2024-10-18 bioinformatics 10.1101/2024.10.15.618547 medRxiv

Top 0.1%

35.5%

Show abstract

BackgroundA-to-I RNA editing, mediated by ADAR enzymes, plays a crucial role in cancer and autoimmune disorders but lacks standardized tools for differential analysis. After reviewing 55 studies, it highlights significant methodological heterogeneity, hindering result comparability and reproducibility. To address this, we developed a2iHelper, a Python package that streamlines RNA editing analysis by filtering noise, performing statistical analyses, and generating visualizations. a2iHelper integrates seamlessly with Python machine learning tools, aiming to standardize and simplify RNA editing research. ResultsWe applied our methods to analyze A-to-I editing in a public dataset comparing wild-type and ADAR knockout. The source code is open, freely available on GitHub, and organized in a well-documented Python package. Using Snakemake for preprocessing, we conducted differential editing analysis on the top 104 most edited genes. The results include p-values from statistical tests, Odds ratios for Manhattan plots, and correlations between editing frequencies and gene expression, visualized in various plots. ConclusionsWe developed a2iHelper, a Python-based package for analyzing and visualizing A-to-I RNA editing data. It allows researchers with minimal programming experience to perform organized, reproducible editing analyses. Novice users can easily detect editing sites, filter noise, and generate figures, while advanced users can integrate functionalities into their workflows. a2iHelper runs on personal computers without needing High-Performance Computing resources.

20

yQTL Pipeline: a structured computational workflow for large scale quantitative trait loci discovery and downstream visualization

Li, M.; Song, Z.; Gurinovich, A.; Schork, N.; Sebastiani, P.; Monti, S.

2024-01-30 bioinformatics 10.1101/2024.01.26.577518 medRxiv

Top 0.1%

35.1%

Show abstract

1Quantitative trait loci (QTL) denote regions of DNA whose variation is associated with variations in quantitative traits. QTL discovery is a powerful approach to understand how changes in molecular and clinical phenotypes may be related to DNA sequence changes. However, QTL discovery analysis encompasses multiple analytical steps and the processing of multiple input files, which can be laborious, error prone, and hard to reproduce if performed manually. In order to facilitate and automate large-scale QTL analysis, we developed the yQTL Pipeline, where the y indicates the dependent quantitative variable being modeled. Prior to genome-wide association test, the pipeline supports the calculation or the direct input of pre-defined genome-wide principal components and genetic relationship matrix when applicable. User-specified covariates can also be provided. Depending on whether familial relatedness exists among the subjects, genome-wide association tests will be performed using either a linear mixed-effect model or a linear model. Using the workflow management tool Nextflow, the pipeline parallelizes the analysis steps to optimize run-time and ensure results reproducibility. In addition, a user-friendly R Shiny App is developed to facilitate result visualization. Upon uploading the result file, it can generate Manhattan plots of user-selected phenotype traits and trait-QTL connection networks based on user-specified p-value thresholds. We applied the yQTL Pipeline to analyze metabolomics profiles of blood serum from the New England Centenarians Study (NECS) participants. A total of 9.1M SNPs and 1,052 metabolites across 194 participants were analyzed. Using a p-value cutoff 5e-8, we found 14,983 mQTLs cumulatively associated with 312 metabolites. The built-in parallelization of our pipeline reduced the run time from [~]90 min to [~]26 min. Visualization using the R Shiny App revealed multiple mQTLs shared across multiple metabolites. The yQTL Pipeline is available with documentation on GitHub at https://github.com/montilab/yQTL-Pipeline.